Optimal Multi-Paragraph Text Segmentation by Dynamic Programming 

Oskari Heinonen 

University of Helsinki, Department of Computer Science 
P.O. Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki, Finland 

Oskari. Heinonen @ cs.Helsinki.FI 



00 
On 
On 
t— i 

O 

Q 



U 

o 



> 
in 
o 
o 

(N 

oo 
On 

O 



X 



Abstract 

There exist several methods of calculating a similar- 
ity curve, or a sequence of similarity values, repre- 
senting the lexical cohesion of successive text con- 
stituents, e.g., paragraphs. Methods for deciding 
the locations of fragment boundaries are, however, 
scarce. We propose a fragmentation method based 
on dynamic programming. The method is theoret- 
ically sound and guaranteed to provide an optimal 
splitting on the basis of a similarity curve, a pre- 
ferred fragment length, and a cost function defined. 
The method is especially useful when control on 
fragment size is of importance. 

1 Introduction 

Electronic full-text documents and digital libraries 
make the utilization of texts much more effective 
than before; yet, they pose new problems and re- 
quirements. For example, document retrieval based 
on string searches typically returns either the whole 
document or just the occurrences of the searched 
words. What the user often is after, however, is mi- 
crodocument: a part of the document that contains 
the occurrences and is reasonably self-contained. 

Microdocuments can be created by utilizing lex- 
ical cohesion (term repetition and semantic rela- 
tions) present in the text. There exist several meth- 
ods of calculating a similarity curve, or a sequence 
of similarity values, representing the lexical cohe- 
sion of successive constituents (such as paragraphs) 
of text (se e, e.g., (|Hearst, 1994|; [Hearst, 1997|; Koz 
ima, 1993; Morris and Hirst, 199 1| ; |Yaari, 1997 



Youmans, 1991)). Methods for deciding the loca- 



tions of fragment boundaries are, however, not that 
common, and those that exist are often rather heuris- 
tic in nature. 

To evaluate our fragmentation method, to be ex- 
plained in Section ^, we calculate the paragraph 
similarities as follows. We employ stemming, re- 
move stopwords, and count the frequencies of the 



remaining words, i.e., terms. Then we take a pre- 
defined number, e.g., 50, of the most frequent terms 
to represent the paragraph, and count the similar- 
ity using the cosine coefficient (see, e.g., (Saltan, 
1989)). Furthermore, we have applied a sliding win- 
dow method: instead of just one paragraph, sev- 
eral paragraphs on both sides of each paragraph 
boundary are considered. The paragraph vectors are 
weighted based on their distance from the boundary 
in question with immediate paragraphs having the 
highest weight. The benefit of using a larger win- 
dow is that we can smooth the effect of short para- 
graphs and such, perhaps example-type, paragraphs 
that inteiTupt a chain of coherent paragraphs. 

2 Fragmentation by Dynamic 
Programming 

Fragmentation is a problem of choosing the para- 
graph boundaries that make the best fragment 
boundaries. The local minima of the similarity 
curve are the points of low lexical cohesion and thus 
the natural candidates. To get reasonably-sized mi- 
crodocuments, the similarity information alone is 
not enough; also the lengths of the created frag- 
ments have to be considered. In this section, we de- 
scribe an approach that performs the fragmentation 
by using both the similarities and the length infor- 
mation in a robust manner. The method is based on 
a programming p aradigm called dynam ic program- 
ming (see, e.g., ( Cormen et al., 1990 )). Dynamic 
programming as a method guarantees the optimal- 
ity of the result with respect to the input and the 
parameters. 

The idea of the fragmentation algorithm is as fol- 
lows (see also Fig. [I]). We start from the first bound- 
ary and calculate a cost for it as if the first paragraph 
was a single fragment. Then we take the second 
boundary and attach to it the minimum of the two 
available possibilities: the cost of the first two para- 
graphs as if they were a single fragment and the cost 



fragmentation^, p, h, len[l..n], sim[l..n — 1]) 
/* n no. of pars, p preferred frag length, h scaling */ 
/* Zen[l..n] par lengths, sim[l..n — 1] similarities */ 
{ 

sim[Q] := 0.0; cost[0] := 0.0; B := 0; 
for par := 1 to n { 

ferisum := 0; /* cumulative fragment length */ 
c min := MAXREAL; 
for i :— par to 1 { 

/67i sum . — len sum 4" /c?i[i], 
c := ci on (/en sum , p, h); 
if c > c m in { I* optimization */ 
exit the innermost for loop; 

} 

c := c + cost[i — 1] + sim[i — 1]; 
if c < c min { 

} 

} 

cos£[par] := c m ; n ; link pxev [par] := /oc_c m i n ; 

} 

j := ra; 

while Zmfc prcv [j] > { 

B := BU link picv [j}; j := Zmfc prev [j]; 

} 

return(S); /* set of chosen fragment boundaries */ 



Figure 1 : The dynamic programming algorithm for 
fragment boundary detection. 

of the second paragraph as a separate fragment. In 
the following steps, the evaluation moves on by one 
paragraph at each time, and all the possible loca- 
tions of the previous breakpoint are considered. We 
continue this procedure till the end of the text, and 
finally we can generate a list of breakpoints that in- 
dicate the fragmentation. 

The cost at each boundary is a combination of 
three components: the cost of fragment length q en , 
and the cost cosf[-] and similarity sim[-] of some 
previous boundary. The cost function c\ cn gives the 
lowest cost for the preferred fragment length given 
by the user, say, e.g., 500 words. A fragment which 
is either shorter or longer gets a higher cost, i.e., is 
punished for its length. We have experimented with 
two families of cost functions, a family of second 
degree functions (parabolas), 
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Figure 2: Similarity curve and detected fragment 
boundaries with different cost functions, (a) Lin- 
ear, (b) Parabola, p is 600 words in both (a) & (b). 
"H0.25", etc., indicates the value of h. Vertical bars 
indicate fragment boundaries while short bars below 
horizontal axis indicate paragraph boundaries. 

where x is the actual fragment length, p is the pre- 
ferred fragment length given by the user, and h is a 
scaling parameter that allows us to adjust the weight 
given to fragment length. The smaller the value of 
h, the less weight is given to the preferred fragment 
length in comparison with the similarity measure. 

3 Experiments 

As test data we used Mars by Percival Lowell, 1895. 
As an illustrative example, we present the analysis 
of Section I. Evidence of it of Chapter II. Atmo- 
sphere. The length of the section is approximately 
6600 words and it contains 55 paragraphs. The frag- 
ments found with different parameter settings can 
be seen in Figure ||. One of the most interesting is 
the one with parabola cost function and h = .5. In 
this case the fragment length adjusts nicely accord- 
ing to the similarity curve. Looking at the text, most 
fragments have an easily identifiable topic, like at- 
mospheric chemistry in fragment 7. Fragments 2 
and 3 seem to have roughly the same topic: measur- 
ing the diameter of the planet Mars. The fact that 
they do not form a single fragment can be explained 
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The method is especially useful when control 
over fragment size is required. This is the case 
in passage retrieval since windows of 1000 bytes 



(Wilkinson and Zobel, 1995) or some hundred 



Table 1: Variation of fragment length. Columns: 
Uvg, Imin, ^max average, minimum, and maximum 
fragment length; and d avg average deviation. 



by the preferred fragment length requirement. 

Table [j] summarizes the effect of the scaling fac- 
tor h in relation to the fragment length variation 
with the two cost functions over those 8 sections 
of Mars that have a length of at least 20 para- 
graphs. The average deviation d avg with respect 
to the preferred fragment length p is defined as 
c^avg = (YhLi \p ~ h\)/fn where l{ is the length of 
fragment i, and m is the number of fragments. The 
parametric cost function chosen affects the result a 
lot. As expected, the second degree cost function 
allows more variation than the linear one but roles 
change with a small h. Although the experiment is 
insufficient, we can see that in this example a factor 
h > 1.0 is unsuitable with the linear cost function 
(and h = 1.5 with the parabola) since in these cases 
so much weight is given to the fragment length that 
fragment boundaries can appear very close to quite 
strong local maxima of the similarity curve. 

4 Conclusions 

In this article, we presented a method for detect- 
ing fragment boundaries in text. The fragmentation 
method is based on dynamic programming and is 
guaranteed to give an optimal solution with respect 
to a similarity curve, a preferred fragment length, 
and a parametric fragment-length cost function de- 
fined. The method is independent of the similarity 
calculation. This means that any method, not nec- 
essarily based on lexical cohesion, producing a suit- 
able sequence of similarities can be used prior to 
our fragmentatio n method. For example, the lexical 
cohesion profile ( Kozima, 1993 ) should be perfectly 
usable with our fragmentation method. 



words ( jCallan, 1994j ) have been proposed as best 
passage sizes. Furthermore, we believe that frag- 
ments of reasonably similar size are beneficial in 
our intended purpose of document assembly. 

Acknowledgements 

This work has been supported by the Finnish 
Technology Development Centre (TEKES) together 
with industrial partners, and by a grant from the 
350th Anniversary Foundation of the University 
of Helsinki. The author thanks Helena Ahonen, 
Barbara Heikkinen, Mika Klemettinen, and Juha 
Karkkainen for their contributions to the work de- 
scribed. 

References 

J. P. Callan. 1994. Passage-level evidence in doc- 
ument retrieval. In Proc. SIGIR'94, Dublin, Ire- 
land. 

T. H. Cormen, C. E. Leiserson, and R. L. Rivest. 

1990. Introduction to Algorithms. MIT Press, 

Cambridge, MA, USA. 
M. A. Hearst. 1994. Multi-paragraph segmentation 

of expository text. In Proc. ACL-94, Las Cruces, 

NM, USA. 

M. A. Hearst. 1997. TextTiling: Segmenting text 
into multi-paragraph subtopic passages. Compu- 
tational Linguistics, 23(l):33-64, March. 

H. Kozima. 1993. Text segmentation based on sim- 
ilarity between words. In Proc. ACL-93, Colum- 
bus, OH, USA. 

J. Morris and G. Hirst. 1991. Lexical cohesion 
computed by thesaural relation as an indicator of 
the structure of text. Computational Linguistics, 
17(l):21-48. 

G. Saltan. 1989. Automatic Text Processing: The 
Transformation, Analysis, and Retrieval of Infor- 
mation by Computer. Addison-Wesley, Reading, 
MA, USA. 

R. Wilkinson and J. Zobel. 1995. Comparison of 
fragmentation schemes for document retrieval. In 
Overview ofTREC-3, Gaithersburg, MD, USA. 

Y. Yaari. 1997. Segmentation of expository texts by 
hierarchical agglomerative clustering. In Proc. 
RANLP'97, Tzigov Chark, Bulgaria. 

G. Youmans. 1991. A new tool for discourse anal- 
ysis. Language, 67(4):763-789. 



Errata 



Table [I is incorrect. Table |2| is correct. 
Figure 2 is tiny. Figure |] has been enlarged. 
Figure 3 here is additional. 
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Table 2: Variation of fragment length. Columns: Z avg , Z m ; n , Z max average, minimum, and maximum fragment 
length; and d avg average deviation. 
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Figure 3: Similarity curve and detected fragment boundaries. Parabola cost function, p is 600 words, and 
h = .25. Vertical bars indicate fragment boundaries while short bars below horizontal axis indicate para- 
graph boundaries. 
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Figure 4: Similarity curve and detected fragment boundaries with different cost functions, (a) Linear, 
(b) Parabola, p is 600 words in both (a) & (b). "H0.25", etc., indicates the value of h. Vertical bars 
indicate fragment boundaries while short bars below horizontal axis indicate paragraph boundaries. 



